图形神经网络(GNNS)已成为处理机器学习任务的有效方法,它为构建推荐系统带来了一种新方法,其中可以将推荐任务作为用户 - 项目的链接预测问题提出, 。培训基于GNN的推荐系统(GNNRECSYS)在大图上会引起大型内存足迹,很容易超过典型服务器上的DRAM容量。现有的解决方案诉诸分布式子图培训,这是由于动态构建子图和各个子图的大量冗余的高成本而效率低下。新兴的Intel Optane持久记忆使一台机器以可承受的成本具有最多6 TB的存储器,从而使单机器Gnnrecsys训练可行,从而消除了分布式培训中的效率低下。与DRAM相比,将Optane用于Gnnrecsys的一个主要问题是Optane相对较低的带宽。由于其主要的计算内核稀疏且内存访问密集,因此这种限制可能对Gnnrecsys工作量的高性能特别有害。为了了解Optane是否适合Gnnrecsys培训,我们对Gnnrecsys工作负载进行了深入的表征和全面的基准测试研究。我们的基准测试结果表明,经过正确配置后,基于Optane的单机器GNNRECSYS训练优于大幅度的培训,尤其是在处理深度GNN模型时。我们分析了加速度的来源,提供有关如何为GNNRECSYS工作负载配置Optane的指导,并讨论进一步优化的机会。
translated by 谷歌翻译
借助视频级标签,弱监督的时间动作本地化(WTAL)应用逐个分类的本地化范式来检测和分类该动作在未修剪的视频中。由于分类的特征,不可避免地会误导特定的背景片段以提高分类器在WTAL中的可区分性。为了减轻背景的干扰,现有的方法试图通过用伪snippet级注释对背景片段进行建模,从而扩大动作和背景之间的差异,这在很大程度上依赖于人工假设。与以前的作品不同,我们提出了一种对抗性学习策略,以打破采矿伪背景片段的局限性。具体而言,背景分类损失迫使整个视频被背景梯度增强策略视为背景,从而使识别模型混淆。相反,前景(动作)损失指导模型在这种情况下关注动作片段。结果,两个分类损失之间的竞争驱动了模型以提高其行动建模能力。同时,一个新型的时间增强网络旨在促进该模型基于提议的策略来构建亲和力摘要的时间关系,以进一步改善动作定位的性能。最后,在Thumos14和ActivationNet1.2上进行的广泛实验证明了该方法的有效性。
translated by 谷歌翻译
手眼校准问题是机器人研究中的重要应用问题。基于双重季节矢量的2个标准,我们为手眼校准问题提出了一种新的双季节优化方法。双重四基因优化问题分解为两个四基因优化子问题。第一个四基因优化子问题控制着机器人手的旋转。可以通过特征值分解或单数值分解有效地求解。如果第一个四基金优化子问题的最佳值为零,则系统无噪音,即,存在``Perfect''机器人手动运动,该机器人手动运动完全满足所有测试的旋转。在这种情况下,我们应用正规化技术来求解第二个子问题以最大程度地减少翻译的距离。否则,我们将修补技术应用于第二个四基因优化子问题。然后求解第二个四基因优化子问题是解决了二次约束二次程序。通过这种方式,我们为手眼校准问题的解决方案集提供了完整的描述。这在手眼校准文献中是新的。还提出了数值结果以显示所提出方法的效率。
translated by 谷歌翻译
与人类的视野相比,基于卷积神经网络(CNN)的计算机视觉更容易受到对抗性的噪音。这种差异可能归因于眼睛如何样本视觉输入以及大脑如何通过其背侧和腹侧视觉途径处理视网膜样品,这些途径尚未探索计算机视觉。受到大脑的启发,我们设计了复发性神经网络,包括模拟人类视网膜的输入采样器,它是一个指导下一步位置的背面网络,以及代表视网膜样品的腹网络。组合这些模块,这些模型学会了多一眼图像,每一眼就注意一个明显的部分,并随着时间的推移积累表示形式以识别图像。我们测试了此类模型的稳健性,并在不同水平的对抗噪声上测试,特别关注不同输入采样策略的效果。我们的发现表明,视网膜凹和采样使模型更加可靠,并且在给予更长的时间以更多地看一眼图像时,该模型可能会从攻击中纠正自身。总之,强大的视觉识别可以从三种受脑启发的机制的综合使用中受益:视网膜转化,注意力引导的眼动运动和经常性处理,而不是仅喂食的CNN。
translated by 谷歌翻译
在自然语言处理中,大多数模型都尝试仅仅从文本学习语义表示。学习的表示编码了分布语义,但未能连接到物理世界的任何知识。相比之下,人类通过在感知和行动中接地概念来学习语言,并且大脑编码接地语义进行认知。灵感来自这一概念和最近的愿景 - 语言学习的工作,我们设计了一个用于愿景中的接地语言学习的两流模型。该模型包括基于VGG的视觉流和基于BERT的语言流。这两条流合并到联合代表空间中。通过跨模型对比学习,该模型首先学会与MS Coco DataSet对齐视觉和语言表示。该模型进一步学习通过跨模型注意模块检索具有语言查询的视觉对象,并通过与视觉基因组数据集推断通过双线性操作员通过双线性运算符之间的视觉关系。在培训之后,该模型的语言流是一种独立语言模型,能够在视觉上接地的语义空间中嵌入概念。这种语义空间表现出主要尺寸可与人类直觉和神经生物学知识达到典型。这个语义空间中的单词嵌入是预测人类定义的语义特征规范,并且被隔离成感知的独特簇。此外,视觉接地的语言模型还通过基于图像,文本或其组合的查询来实现基于视觉知识和多模式图像搜索的组成语言理解。
translated by 谷歌翻译
在2019年的大流行病(Covid-19)感染SARS-COV-2的小型冠状病病(Covid-19)中,很快就迅速进行了大量的预防和治疗药物研究,但迄今为止,这些努力取得了不成功。我们的目标是利用药物重新淘点的管道优先考虑可重复的药物,系统地整合多个SARS-COV-2和药物相互作用,深图神经网络和基于体外/人口的验证。我们首先通过CTDBase收集涉及Covid-19患者治疗的所有可用药物(n = 3,635)。我们基于病毒诱饵,宿主基因,途径,药物和表型之间的相互作用构建了SARS-COV-2知识图。基于生物相互作用,使用深图神经网络方法来得出候选表示。我们利用临床试验验证药物优先考虑候选药物,然后用它们的遗传谱,体外实验疗效和电子健康记录验证。我们突出了前22名药物,包括阿奇霉素,阿托伐他汀,阿司匹林,对乙酰氨基酚和阿巴替代醇。我们进一步确定了可能协同靶向Covid-19的药物组合。总之,我们证明了广泛的相互作用,深度神经网络和严格验证的整合可以促进Covid-19治疗的候选药物的快速鉴定。这是一个post-poser-review,在科学报告中发布的文章的Pre-Copyedit版本最终经过身份验证版本可在线获取:https://www.researchsquare.com/article/rs-114758/v1
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译